Audio-Visual Contrastive Learning with Temporal Self-Supervision

نویسندگان

چکیده

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and accompanying audio without human supervision. In contrast to images capture static scene appearance, also contain sound temporal dynamics. To leverage aural dimension inherent videos, our method extends self-supervision audio-visual setting integrates it with multi-modal contrastive objectives. As self-supervision, we pose playback speed direction recognition in modalities intra- inter-modal ordering tasks. Furthermore, design novel objective which usual pairs are supplemented additional sample-dependent positives negatives sampled from evolving feature space. model, apply such losses among video clips between their temporally corresponding clips. verify model extensive ablation experiments evaluate transfer action retrieval on UCF101 HMBD51, classification ESC50, robust fingerprinting VGG-Sound, state-of-the-art results.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Unsupervised Visual Grounding Through Semantic Self-Supervision

Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple int...

متن کامل

Self-Supervision for Reinforcement Learning

Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquit...

متن کامل

Comparing the Impact of Audio-Visual Input Enhancement on Collocation Learning in Traditional and Mobile Learning Contexts

: This study investigated the impact of audio-visual input enhancement teaching techniques on improving English as Foreign Language (EFL) learnersˈ collocation learning as well as their accuracy concerning collocation use in narrative writing. In addition, it compared the impact and efficiency of audio-visual input enhancement in two learning contexts, namely traditional and mo...

متن کامل

Leveraging Inexpensive Supervision Signals for Visual Learning

The success of deep learning based methods for computer vision comes at a cost. Most deep neural network models require a large corpus of annotated data for supervision. The process of obtaining such data is often time consuming and expensive. For example, the process of collecting bounding box annotations takes 26-42 seconds per box. This requirement poses a hindrance for extending these metho...

متن کامل

Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, thro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i7.25967